As a follow-up to a previous post about correlations between Texas high school academic UIL competition scores and SAT/ACT scores, I wanted explore some of the “alternatives” to joining the two data sets—which come from different sources. In that post, I simply perform a an inner_join() using the school and city names as keys. While this decision ensures that the data integrity is “high”, there are potentially many un-matched schools that could have been included in the analysis with some sound “fuzzy matching”. For my exploration here, I’ll leverage David Robinson’s excellent {fuzzyjoin} package.
My dissatisfaction with the results that I came up with various fuzzy matching attempts ultimately led me to “settle” upon
In its raw form,the UIL data—scraped from https://www.hpscience.net/ (which was the only site that I found to have the academic competition scores)—is much “more unclean” than the data from the SAT and ACT data from the TEA website. In particular, many of the city, schoo, and individual participant names are inconsistent across different webpages. 1 Even after eliminating much of the “self-inconsistency” of the UIL data and creating a suitable “clean” data set to use as the basis for my series of posts exploring the UIL data exclusively, there are more than a few differences in the names of schools between the UIL and TEA data.
To aid in my experimentation with “fuzzy joining”, I use a function from my {tetidy} package–join_fuzzily(), which is essentially a wrapper for the stringdist_*_join() functions provided by the {fuzzyjoin} package. With this function, I was able to evaluate different values for max_dist and different join columns to help me make a judgement regarding the quality of joins. I primarily considered counts of joined and unjoined rows computed with the summarise_join_stats() function—also from my {tetidy} package—to make a decision on how I should join the UIL and SAT/ACT school data. 2
What follows is the results of some of my experimentation. Of course, let’s first have a look at the data sets that we’re working with. (See my other posts to see how I created these data sets.)
schools_tea
schools_uil
For my first attempt at fuzzy matching, I tried joining using only school as a key column and setting max_dist = 1. (Note that I do a “full” join because my tetidy::summarise_join_stats() function works best with this kind of input.)
library("tidyverse")
schools_uil_distinct <-
schools_uil %>%
distinct(school, city)
summ_schools_joinfuzzy_1 <-
schools_tea %>%
tetidy::join_fuzzily(
schools_uil_distinct,
mode = "full",
max_dist = 1,
cols_join = c("school"),
suffix_x = "_tea",
suffix_y = "_uil"
) %>%
tetidy::summarise_join_stats(school_uil, school_tea) %>%
select(-x, -y) %>%
gather(metric, value)
summ_schools_joinfuzzy_1
| metric | value |
|---|---|
| n_x | 11,402 |
| n_y | 16,087 |
| n_joined | 11,153 |
| n_x_unjoined | 4,934 |
| n_y_unjoined | 249 |
| x_in_y_pct | 98 |
| y_in_x_pct | 69 |
This kind of join represents a very “optimistic” or “over-zealous” implementation that likely results in matches that are not “true”. Nonetheless, it’s hard to really conclude anything about the quality of the join without comparing it to the results of another attempt.
Next, I tried the same join, only setting max_dist = 0 this time. Note that this is really like “exact” string matching, meaning that there is not really any fuzzy matching going on. (Nonetheless, it was worthwhile to evaluate the counts of matches and mis-matches with a full join.)
summ_schools_joinfuzzy_2 <-
schools_tea %>%
tetidy::join_fuzzily(
schools_uil_distinct,
mode = "full",
max_dist = 0,
cols_join = c("school"),
suffix_x = "_tea",
suffix_y = "_uil"
) %>%
tetidy::summarise_join_stats(school_uil, school_tea) %>%
select(-x, -y) %>%
gather(metric, value)
summ_schools_joinfuzzy_2
| metric | value |
|---|---|
| n_x | 10,755 |
| n_y | 15,491 |
| n_joined | 10,487 |
| n_x_unjoined | 5,004 |
| n_y_unjoined | 268 |
| x_in_y_pct | 98 |
| y_in_x_pct | 68 |
As we should expect, this results in a lower number of joins, but my feeling is that the join is still too naive as a result of using only one key column for joining—school.
What about changing the key columns to school and city and setting max_dist = 1? (Note that I unite() the two columns only for my {tetidy::summarise_join_stats() function, which can only work with a single “key” column. If only joining the data, the two join columns could be kept separate.)
summ_schools_joinfuzzy_3 <-
schools_tea %>%
unite(school_city, school, city, remove = FALSE) %>%
tetidy::join_fuzzily(
schools_uil_distinct %>%
unite(school_city, school, city, remove = FALSE),
mode = "full",
max_dist = 1,
cols_join = c("school_city"),
suffix_x = "_tea",
suffix_y = "_uil"
) %>%
tetidy::summarise_join_stats(school_city_uil, school_city_tea) %>%
select(-x, -y) %>%
gather(metric, value)
summ_schools_joinfuzzy_3
| metric | value |
|---|---|
| n_x | 2,694 |
| n_y | 13,814 |
| n_joined | 1,549 |
| n_x_unjoined | 12,265 |
| n_y_unjoined | 1,145 |
| x_in_y_pct | 57 |
| y_in_x_pct | 11 |
This is probably the best option of the many alternatives that I tested—including many not shown here that try different columns for joining and increasing the value of max_dist. In fact, the number of rows that would be returned with an inner join and the same settings (i.e. max_dist = 1 and the key columns)—as indicated by the value n_joined—is only slightly greater than that which you get when you combine the two data sets with an inner_join() on school and city, as I ended up doing. 3
In some ways, I think this is not too unlike the classic “bias vs. variance” trade-off with machine learning. To what extent do you try to minimize error with your training data—and potentially overfit your model? Or do you try to make a simpler model that generalizes better to a data set outside the training data—potentially creating a model that is “underfit” because it is too simplified? In this case, the question is about to what extent do you try to maximize the number of observations joined—perhaps creating some false matches that compromise the data and the results? And, from the other persepective, if you don’t include all of the data that is only not included due to poor matching, how can you guarante that your data “representative” are that your results are legitimate? It’s a tricky trade-off, as with many things in the realm of data analysis. And, as with most things, the best solution depends heavily on the context. In this case, I think erring on the side of caution with a “conservative” joining policy is a logical choice because there are lots of rows that are matched, and one can only assume that there is no underlying “trend” or “bias” in those schools that are mismatched. (i.e. The un-matched schools are just as ikely to be superior/inferior academically relative to all other schools as those that are matched.)
Something that I did not consider here but is certainly worthy of exploration is neural network text classificaton. With this approach, I could quantify the probability that one string is “equivalent” to another, and choose matches with the highest probabilities for any given string. I have a feeling that this kind of approach would be more successful, although it does not seem as easy to implement. (I may come back and visit this idea in a future post.)
In fact, the rigor involved in cleaning the UIL data obliged be to completely hide it from the reader in my write-ups on the topic.↩
Check out my post on the “DRY” principle and its application to R packages. Creating packages for actions that you perform across projects is a real time-saver!↩
Actually, I also joined on year in the previous post (in order to make sure that school scores across years were “aligned” properly.↩